Chris Pollett > Old Classses >
CS267

( Print View )

Student Corner:
  [Grades Sec1]

  [Submit Sec1]

  [
Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Description]
  [Course Outcomes]
  [Outcomes Matrix]
  [Course Schedule]
  [Grading]
  [Requirements/HW/Quizzes]
  [Class Protocols]
  [Exam Info]
  [Regrades]
  [University Policies]
  [Announcements]

HW Assignments:
  [Hw1]  [Hw2]  [Hw3]
  [Hw4]  [Hw5]  [Quizzes]

Practice Exams:
  [Midterm]  [Final]

                           












HW#3 --- last modified February 06 2019 04:16:06..

Solution set.

Due date: Apr 11

Files to be submitted:
  Hw3.zip

Purpose:

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO3 -- Be able to explain where BM25, BM25F and divergence from randomness statistics come from.

LO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

LO6 -- Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework consists of three parts: An exercise part, a coding part, and an evaluation part. For the exercise part of the assignment I want you to do the following book problems or variants of book problems: Problem 4.1 but where the posting list for some term consists of 58 million postings each of which consume 8 bytes, Problem 4.3, Problem 4.4, Problem 5.7. Put your solutions to these problems in Problems.pdf which should be included in the Hw3.zip file you submit.

For the coding portion of the assignment I want you to extend the program you wrote for Hw2 so that now it takes as a possible third arguments bm25 and bm25f. You can either start with your own code or extend the Hw2 solution. If bm25 is selected as the ranking method, then it should calculate the bm25 scores of all the documents which contain the query terms and output the results in decreasing order of BM25 relevance. To do this it should you should you the techniques of the March 16 lecture, where binary heaps are used in the retrieval process. Be aware that the Standard PHP Library has PHP classes for min heaps and max heaps so don't reinvent the wheel/heap.

For this assignment your code will be run with a line like:

php search_program.php some_dir query ranking_method tokenization_method alpha

Here the parameters are like Hw2 except for the following. First, some_dir is no longer a folder of txt files, it is now a folder of HTML files. Your program should again be a Composer-based project using Yioop. In addition, to using Yioop for stemming/chargramming, I want you to use the seekquarry\yioop\library\processors\HtmlProcessor class to do the text extraction from the HTML documents. To do this create an instance of HtmlProcessor with max description set at 20000, the summarizer as self::CENTROID_SUMMARIZER, and everything else at its default value. Then call the process() method with the read in contents of an HTML file and a dummy url. What this will return is an associative array with a CrawlConstants::TITLE and a CrawlConstants::DESCRIPTION field among several other fields. Unless bm25f is selected as the ranking method, I want you to just use the CrawlConstants::DESCRIPTION as if it had been the plain text of the input. Notice the first several terms in the description field actually are the same as the title field. For bm25f, I want you to remove this duplication between the title and description fields. Another difference between Hw2 and Hw3 is we now half an alpha command line input. For all, ranking methods other than bm25f just ignore the value of alpha and don't complain if it is not present. BM25F is a linear combination of BM25 scores over different document fields. In our case, we image the document as have two fields: title and description. For a document `d` we calculate a `BM25` score for just the text in the title field of the document, `BM25_{\mbox{title}}(d)`, and a `BM25` score for just the text in the description field of the document, `BM25_{\mbox{desc}}(d)`. The BM25F score of `d` with respect to some query `q` is then
`\alpha \cdot BM25_{\mbox{title}}(d, q) + (1 - \alpha)\cdot BM25_{\mbox{desc}}(d, q)`
Here the value of `alpha` should come from the command line argument alpha and should be a value between 0 and 1. The above definition is for the case where we only have two fields, the paper linked to above gives the general case definition to many field and where there is no requirement the weights sum to 1. Note average document lengths, etc used in our above definition should now be average field lengths, etc.

For the evaluation part of the homework, I want you to experimentally try to determine the value of alpha which gives the best ranking. To do this pick your favorite commercial search engine and make a list of five topic queries. Enter these into the search engine and get the top 20 results for each query. Download these 100 documents and put them in a test folder (which you submit with the homework). For each query, determine as a human what you think the order of the twenty documents returned should have been. We'll use the notation `pi(i)` to denote the function `pi` which takes a document with id `i`, and returns it rank in results according to some ranking function. Call the human ordering you just did the best ordering, `pi_{best}`. Given an ordering of document `pi` of the documents, define
`\mbox{score}(pi) = sum_{i=1}^20 |pi(i) - pi_{best}(i)|`
A score of 0 would mean `pi` returned the same ordering as the best one. For each of the five queries for each increment 0.1 of `alpha` between 0 and 1, I want you to compute `\mbox{score}(pi_{BM25F_\alpha})`. I want you to plot the points you get in a graph and to compute an average best `alpha` value. Finally, I want you to write up a summary of your experiments in experiments.txt. This summary should contain at least a one paragraph conclusion on what you found out.

Point Breakdown

Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 4pts
Code makes use of heaps to output rankings as described in class1pt
Code modified to handle HTML files as inputs as described1pt
BM25 extension of HW2 code works as described1pt
BM25F extension of HW2 code works as described1pt
Tests data sets created in test subfolder and experiments conducted as described1pt
Plots presented and at least one paragraph conclusion about results written in experiments.txt1pt
Total10pts